Another Look at the Physics of Large Deviations 
With Application to Rate-Distortion Theory 

Neri Merhav 



o ■ 
o ■ 

<n : 

s : 
< 

in ■ 

<n : 



CO 



> 
(N 

in 

rn 

o 

On 
O 



X 



Abstract — We revisit and extend the physical interpretation 
recently given to a certain identity between large-deviations rate- 
functions (as well as applications of this identity to Information 
Theory), as an instance of thermal equilibrium between several 
physical systems that are brought into contact. Our new inter- 
pretation, of mechanical equilibrium between these systems, is 
shown to have several advantages relative to that of thermal 
equilibrium. This physical point of view also provides a trigger 
to the development of certain alternative representations of the 
rate-distortion function and channel capacity, which are new to 
the best knowledge of the author. 

Index Terms — Large deviations theory, Chernoff bound, statis- 
tical physics, free energy mechanical equilibrium, rate-distortion 
theory. 



I. Introduction 

RELATIONSHIPS between information theory and sta- 
tistical physics have been widely recognized in the last 
few decades, from a wide spectrum of aspects. These include 
conceptual aspects, of parallelisms and analogies between the- 
oretical principles in the two disciplines, as well as technical 
aspects, of mapping between mathematical formalisms in both 
fields and borrowing analysis techniques from one field to 
the other. One example of such a mapping, is between the 
paradigm of random codes for channel coding and certain 
models of magnetic materials, most notably, Ising models 
and spin glass models (cf. e.g., iflOll and many references 
therein). Today, it is quite widely believed that research in the 
intersection between information theory and statistical physics 
may have the potential of fertilizing both disciplines. 

This paper is more related to the former aspect mentioned 
above, namely, the relationships between the two areas in the 
conceptual level. In particular, we revisit results of a recent 
work [9], and propose a somewhat different perspective, which 
as we believe, has certain advantages, that will be explained 
and shown in the sequel. 

More specifically, in Q, an identity between two forms of 
the rate function of a certain large deviations event was estab- 
lished, with several applications in information theory. Inspired 
by a few earlier works (cf. e.g., 0, HD, QH), this identity 
was interpreted as thermal equilibrium between several many- 
particle physical systems that are brought in contact. In partic- 
ular, the parameter that undergoes optimization of the Chernoff 
bound, henceforth referred to as the Chernoff parameter, plays 
a role that is intimately related to the equilibrium temperature: 
in fact, it is the reciprocal of the temperature, called the inverse 
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temperature. The corresponding large deviations rate function 
is then identified with the entropy of the system. 

While this physical interpretation is fairly reasonable, it 
turns out, as we show in this paper, that it leaves quite some 
room for improvement, and we will mention here just two 
points. The first, is that this interpretation does not generalize 
to rate functions of combinations of two or more rare events, 
where the number of Chernoff parameters is as the number 
of events. This is because there is only one temperature 
parameter in physics. The other point, which is on a more 
technical level, is the following (more details and clarifica- 
tions will follow in Subsection 2B below): while the log- 
moment generating function, pertaining to the large deviations 
rate function, naturally includes weighting by probabilities, 
its physical analogue, which is the partition function, does 
not. If these probabilities are subjected to optimization (e.g., 
optimization of random coding distributions), they may depend 
on the Chernoff parameter, i.e., on the temperature, in a rather 
complicated manner, and then the resulting expression can no 
longer really be viewed as a partition function. 

In this paper, we propose to interpret the above-mentioned 
identity of rate functions as an instance of mechanical equi- 
librium (i.e., balance between mechanical forces), rather than 
thermal equilibrium, and then the Chernoff parameter plays 
the physical role of an external force, or field, applied to 
the physical system in consideration. In this paradigm, the 
large deviations rate function has a natural interpretation as the 
(Helmholtz)/ree energy of the system, rather than as entropy. 
Accordingly, since the rate-distortion function (and similarly, 
also channel capacity) can be thought of as a large deviations 
rate function, it can also be interpreted as the free energy of 
a certain system. 

This interpretation has several advantages. First, it is consis- 
tent with the analogy between the free energy in physics and 
the Kullback-Leibler divergence in information theory (see, 
e.g., ifTl. lfTTl ). which is well known to play a role as a rate 
function when the large deviations analysis is approached by 
the method of types [4|. Second, it is free of the limitations 
mentioned in the previous paragraph, as we will see in the 
sequel. Third, it serves as a trigger to develop certain repre- 
sentations of the rate-distortion function (and analogously, the 
channel capacity), which are new to the best knowledge of the 
author. 

Since the rate-distortion function can be thought of as 
free energy, as mentioned above, one of the representations 
of the rate-distortion function expresses it as (the minimum 
achievable) mechanical work carried out by the aforemen- 
tioned external force, along a 'distance' that is measured in 
terms of the distortion. Another representation, which follows 
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from the first one, is as an integral that involves the single- 
letter minimum mean square error (MMSE) in estimating the 
distortion given the source symbol, according to a certain 
joint distribution of these two random variables. The latter 
representation may suggest a new route to the derivation of 
upper and lower bounds on the rate-distortion function and 
channel capacity, using the plethora of upper and lower bounds 
on MMSE, available from estimation theory. In particular, for 
upper bounds, one may examine the mean squared error of 
an arbitrary estimator, e.g., the best linear estimator. Lower 
bounds, like the Bayesian Cramer-Rao bound and numerous 
others are available in the literature (cf. e.g., lfT3l .l 161 and 
references therein). We have not explored these directions, 
however, in the framework of the work presented herein. 

An additional byproduct of the proposed perspective is 
the following: Given a source distribution and a distortion 
measure, we can describe (at least conceptually) a concrete 
physical system that emulates the rate-distortion problem in 
the following manner (see Fig. [TJ: When no force is applied to 
the system, its total length is nAo, where n is the number of 
particles in the system (and also the block length in the rate- 
distortion problem), and Ao is the distortion corresponding 
to zero coding rate. If one applies to the system a contracting 
force, that increases from zero force to some final force A, such 
that the length of the system shrinks to nA, where A < A 
is analogous a prescribed distortion level, then the following 
two facts hold true: (i) An achievable lower bound on the total 
amount of mechanical work that must be carried out by the 
contracting force in order to shrink the system to length nA, 
is given by 

W > nkTR(A), 

where k is Boltzmann's constant, T is the temperature, and 
i?(A) is the rate-distortion function, (ii) The final force A is 
related to A according to A = kTR'(A), where R'(-) is the 
derivative of R(-). 

Thus, we observe that R(A) plays a role of a fundamental 
limit, not only in information theory, but also in physics. 




r " A A = kTR' (A) i 



S, „ A[) ^ 

Fig. 1. Emulation of -R(A) by a physical system. 



The outline of the paper is as follows. In Section 2, we 
provide some background in physics (Subsection 2A) and 
give a brief description of the physical interpretation proposed 
in (9j| (Subsection 2B). Then, we develop the new proposed 
physical interpretation, first for a generic large deviations 
rate-function (Section 3), and then, in the context of the 
rate-distortion problem (Section 4). In Section 5, we present 
the above mentioned alternative representations of the rate- 
distortion function. Finally, in Section 6, we summarize this 
work and conclude. 

II. Preliminaries 

A. Physics Background 

Consider a physical system with a large number n of 
particles, which can be in a variety of microscopic states 
('microstates'), defined by combinations of, e.g., positions, 
momenta, angular momenta, spins, etc., of all n particles. 
For each such microstate of the system, which we shall 
designate by a vector x — [x\, . . . , x n ), there is an associated 
energy, given by an Hamiltonian (energy function), £(x). For 
example, if Xi = (p i7 /ii), where p i is the momentum vector 
of particle number i and hi is its height, then classically, 

i=l \ ' 
where m is the mass of each particle and g is the gravitation 
constant. 

One of the most fundamental results in statistical physics 
(based on the law of energy conservation and the basic 
postulate that all microstates of the same energy level are 
equiprobable) is that when the system is in thermal equilibrium 
with its environment, the probability of a microstate x is given 
by the Boltzmann-Gibbs distribution 

Fix) = - — (1) 

V ' Z n ((i) 

where f3 — l/(kT), T being temperature, k being Boltzmann's 
constant, and Z n (f3) is the normalization constant, called the 
partition function, which is given by 

x 

Z n (J3) = Jdxe-^ x \ 

depending on whether x is discrete or continuous. The role 
of the partition function is by far deeper than just being a 
normalization factor, as it is actually the key quantity from 
which many macroscopic physical quantities can be derived, 
for example, the Helmholtz free energyQ is — 4 lnZ„(/3), the 
average internal energy (i.e., the expectation of £{x) where 
x drawn is according ([TJ) is given by the negative derivative 

'The physical meaning of the Helmholtz free energy is the following: The 
difference between the Helmholtz free energies of two equilibrium states is 
the minimum work that should be done on the system in any process of fixed 
temperature (isothermal process) in the passage between these two states. 
The minimum is obtained when the process is reversible (slow, quasi-static 
changes in the system). 
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of \nZ n ((3), the heat capacity is obtained from the second 
derivative, etc. One of the ways to obtain eq. (|T|), is as 
the maximum entropy distribution under an energy constraint 
(owing to the second law of thermodynamics), where (3 plays 
the role of a Lagrange multiplier that controls this energy level. 

Under certain assumptions on the Hamiltonian function, 
the following relations are well-known to hold and can 
be found in any textbook on elementary statistical physics 
(see, e.g., 121.171. 1101 ): Defining the per-particle entropy, 
S(E), associated with per-particle energy E = £(x)/n, as 
lim„^ 00 [lnr2(£ , )]/riJl (provided that the limit exists), where 
fl(E) is the number of microstates {x} with energy level 
£(x) — nE, then similarly as in the method of types, one 
can evaluate Z n ((3) defined above, as 

Z n 09)=£fi(£)e-^ 

E 

(in the discrete case), which is of the exponential order of 

exp{nmax[S(£;) - /3E}}. 

E 

Defining 

m = Urn hZM, 

n — >oo fi 

and the Helmholtz free-energy per-particle as 



we obtain the Legendre relation 

4>(0) = max[S(E) - 0E], 

E 

where here E — E(f3) is the maximizer of [S(E) — f3E}. 
For a given /3, the Boltzmann-Gibbs distribution has a sharp 
peak (for large n) at the level of E(j3) Joules per-particle. 
Assuming that £?(•) is concave (which is normally the case), 
the above Legendre relation can be inverted to obtain 

S(E)=mm[/3E + (f,(/3)], 

and both relations can be identified with the thermodynamical 
definition of the Helmholtz free energy as 

F = E-TS. 

In the latter relation, the minimizing f3 = (3(E) (the inverse 
function of E(/3)) is the equilibrium inverse temperature asso- 
ciated with energy level E. The second law of thermodynamics 
asserts that in an isolated system (which does not exchange en- 
ergy with its environment), the total entropy cannot decrease, 
and hence in equilibrium, it reaches its maximum. When the 
system is allowed to exchange heat with the environment 
(at constant volume and temperature), this maximum entropy 
principle is replaced by the minimum free energy principle: 
The Helmholtz free energy cannot increase, and it reaches its 
minimum in equilibrium. 

2 Actually, the definition should also include a factor of k, which we will 
omit in this discussion, thus considering S(E) as the per-particle entropy in 
units of k. 



When the Hamiltonian is additive, that is, 

= ^2s(xi), 

i 

then P(x) has a product form (the particles do not interact), 
and then the above mentioned physical quantities per particle 
can be extracted from the case n = 1. In this additive case, 
the Legendre transform, that takes <f>(0) to S(E), is similar 
to the Legendre transform that defines the rate function (the 
exponent of the Chernoff bound) pertaining to the probability 
of the event 

n 

^£(xi) < nE, 
»=i 

thus the parameter to be optimized in the Chernoff bound plays 
the role of inverse temperature in the corresponding statistical- 
mechanical system. 

Another look at this correspondence between large devia- 
tions rate functions and thermal equilibrium is the following: 
If P is the above mentioned Boltzmann-Gibbs distribution and 
Q is another probability distribution on the micorstates {x}, 
then, as is shown e.g., in JT), the Kullback-Leibler divergence 
between Q and P is given by 

D(Q\\P) = f3(F Q - F P ), 

where Fp and Fq are, respectively, the Helmholtz free ener- 
gies pertaining to P and Q. The rate function pertaining to 
a large deviations event is normally given by the minimum 
divergence under the constraints corresponding to this event 
(see, e.g., Chap. 11]), and so, it is equivalent to minimum 
free energy, i.e., thermal equilibrium by the second law. 

Consider next a system of n non-interacting particles as 
before, except that now the Hamiltonian is shifted by a 
quantity that is proportional to some parameter A, i.e., the 
Hamiltonian is redefined as 

n 

£(x,y) = £ (x) -X-^yi, 

i=l 

where we have changed the notation of the (original) 
Hamiltonian to £a(x), and where {y,} are some additional 
variables used to describe the microstate. These new variables 
may either be dependent or independent of the original 
microstate variables {xi} (both cases are demonstrated 
in Example 1 below) and their number, n, is here taken 
to be the same as the number of {xi}, primarily, for 
reasons of convenience^ The parameter A is thought of 
as an external control parameter, i.e., a driving force (or 
a field) that acts on the system via the state variables 
{Hi}- The parameter A can be a mechanical force (e.g., 
pressure, elastic extraction/contraction force, gravitational 
force), an electric field (acting on an a charged particle or 
an electric dipole), a magnetic field (acting on a magnet 
or spin), or even a chemical driving force (chemical potential). 

Example 1 (may be skipped without loss of continuity). 
Consider the following two systems. The first is the same 

3 In general, their number can be different, but then it is still assumed to 
grow proportionally to n. 
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example as in the first paragraph of this subsection, namely, 
non-interacting particles in motion under gravitation. The 
Hamiltonian, 



Then, defining the partial partition function 



\Pi 



2m 



mgh, 



can be thought of as being composed of the 'original' Hamil- 
tonian J^i l|Pil| 2 /(2w) (with replacing {xi}), and the 
'shifting' term, mg hi, whose force parameter is A = —rag 
(gravitational force), acting on the height variables yi = hi. In 
this example, the variables x = p and y = h = (h%, . . . , h n ) 
are independent. The second system consists of n one- 
dimensional harmonic oscillators (e.g., springs or pendulums), 
where the Hamiltonian is 



E 



hi 



Kyi 



2m 



Pi being the (one-dimensional) momentum, t/j - the displace- 
ment of each oscillator from its equilibrium position, and K 
is the elasticity constant. Now, suppose that an external force 
A is applied to each spring, so the Hamiltonian becomes 



E 



INI 2 , Kyi 



2m 



Ay* 



In this case, the variables of the original Hamiltonian Xi — 
(pi,yi) contain the variables {yi}, of the shifting term, as a 
subset. We also see that the modified Hamiltonian is, within 
an immaterial additive constant, identical to 



E 



lb* 



2m 



K 
~2 



Vi 



A 
K 



This means that the force A shifts the common mean of the 
RV's {yi}, which is equilibrium point of all oscillators, by 
Ay = X/K, as expected. This concludes Example 1. □ 
Consider next the partition function 

x,y 

The Gibbs free energj^ per particle is defined as 

n 

and the asymptotic Gibbs free energy per particle is 

G(/3,A) = lim G„(/3,A). 

n — >oo 

What is the relation between between the Helmholtz free 
energy and the Gibbs free energy? Let Cl(E,Y) ~ e nS(E,Y) 
denote the number of microstates {(x, y)} for which 

y^^o(xi) = nE and ^ y { = nY. 



4 The Gibbs free energy has a meaning similar to the of the Helmholtz 
free energy (see footnote no. 1), but it refers to partial work: the difference 
between the Gibbs free energies of two equilibrium points is the minimum 
amount of work to be done on the system, other than work pertaining to 
changes in the variables {yi}, in an isothermal process with fixed A, in the 
passage between these two points. 



Zn(P,Y) 



E 



-PSo(X) 



{(X,y): £ iB i=nY} 

the normalized Helmholtz free energy for a given Y 

n 

and the corresponding asymptotic normalized Helmholtz free 
energy, 

F{(3,Y) = lim F n {P,Y), 

n — >oo 

we have (similarly as in the method of types): 



-/3nG„(/3,A) 



y^ e -/3[£o(a5)-A£ jW ] 

x,y 

E.Y 

^ e n[S{E,Y)-l3{E-\Y)] 
E,Y 

Y e n/3XY V] e n[S(E,Y)-f3E] 
Y E 

J2e n0XY Z n (P,Y) 



E« 



%0XY _ -0nF n (0,Y) 



= exp{n/3 • max[Ar - F(/3, Y)} (2) 

where = denotes asymptotic equivalence in the exponential 
scaled This results in the Legendre relation 

G(/3, A) = min[F(/3, Y) - XY]. 

Assuming that F({3, Y) is convex in Y for fixed (3, the inverse 
Legendre relation is 

F(/3, Y) = max[G(/3, A) + XY] 

= max[AF-fcTx 



lim — In ( ( 

\x,y 

kT ■ max [/3XY- 



-0[£o(x)-XJ2i Vi] 



lim iln Ve-ftW.^5:,». 
\x,y 
kT ■ max \sY— 



lim -In [ y e -P £ °W -e s ^< y> 
\x,y 



(3) 



where in the last step, we changed the optimization variable 
A to s = (3X for fixed (3. Since s is proportional to A for fixed 
P, and A designates force, we will henceforth refer to s also 
as 'force' (although its physical units are different). We will 
get back to eq. (0 soon. 



More precisely, a n = b n , for two positive sequences {a n } and {b n }, 
1 log X s 



means that — log — ► 0, as n oo. 
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B. A Brief Summary of $9$ 

First, recall that in the previous subsection, we mentioned 
that the Legendre relation 

S(E) = mm\/3E + cf>(f3)} 

is similar to the rate function of the large deviations event 
{Y2i^( x i) — n E} for i.i.d. RV's {xi}, governed by a given 
distribution P. The difference is that in the latter, the log- 
moment generating function 



ln2^P(a:)e 



-/?£(*) 



that undergoes the Legendre transform, contains weighting by 
the probabilities {P(x}}, unlike the log-partition 



which does not. In [9] it was proposed to interpret the weights 
{P(x)} as being proportional to a factor of the multiplicity of 
states {x} having the same energy £{x), i.e., as the degeneracy 
in the physics terminology^ 

When considering applications of large deviations theory 
to information theory, one can view the rate-distortion func- 
tion (and analogously, also channel capacity) as the large- 
deviations rate function of the event {X)"=i d(xi, ii) < nA}, 
where x = (x±, ...,x n ) is a given typical source sequence 
(i.e., its empirical distribution agrees with the source P) 
and {£i} are i.i.d. RV's drawn by a certain random coding 
distribution Q. As was observed in (9), there are two ways 
to express the large deviations rate function of this event, 
which is also the rate-distortion function, Rq(A), for the 
given random distribution Q: The first is by considering all 
distortion variables {d(xi,Xi)} together, on the same footing, 
resulting in the expression 



1(A) 



min 

/3>o 



/3A + J^P(x) ln^Q(x)e 



-/3d(x,x) 



which can also be obtained (see, e.g., J6] p. 90, Corollary 
4.2.3]) using different considerations. The second way is to 
separate the distortion contributions, {A x }, allocated to the 
various source letters {x}, which results in 



1(A) = 



max y 

J2 :c P(x)A m <A^ 



P(x) min [(3A X + 



In Q(x)e~ 



(3 x d(x,x) 



The identity between these two expressions, as was proved in 
[|9l , means that the outer maximum in the second expression 
(maximum entropy) is achieved when {A x } are allocated in 
such a way that the minimizing temperature parameters {j3 x } 
are all the same, namely, thermal equilibrium between all sub- 
systems indexed by x. Once again, {Q(x)} can be interpreted 
as degeneracy, which is fine as long as Q is fixed. However, 

6 Another approach, proposed in 1131 . was to absorb P(x) as part of the 
Hamiltonian, but then the Hamiltonian becomes temperature-dependent, but 
this does not comply with the common paradigm in statistical mechanics. 



the real rate-distortion function, R(A) = minQ Rq(A), is 
obtained by optimization (of either expression) over Q and 
the optimum Q may, in general, depend on (3 (or equivalently, 
on A). In this situation, Q can no longer be given the meaning 
of degeneracy, because in physics, degeneracy has nothing to 
do with temperature. 

Another limitation of interpreting (3 as temperature, is that 
it does not extend to two or more rare events at the same 
time. For instance, the rate-distortion function Rq(Ai, A2) 
w.r.t. two simultaneous distortion constraints, with distortion 
measures d\ and d 2 , is given by the two-dimensional Legendre 
transform 



Rq(A u A 2 ) 



mm mm 

/3i<0/3 2 <0 



/? 1 A 1 +/3 2 A 2 + ^P( 



XIX 



In [J2Q(x)e 



xex 



fi\d\ (x,x)— (32d 2 {x,x) 



(4) 



But this does not have any apparent physical interpretation 
because there is only one temperature in physics. 

III. Large Deviations and Free Energy 

The main idea in this paper is that in order to give a physical 
interpretation to the rate function as the Legendre transform 
of the log-moment generating function, we use the Legendre 
transform that relates the Helmholtz free energy to the Gibbs 
free energy, G(/3, A) (cf. eq. ©), rather than the one that 
relates the Helmholtz free energy to the entropy, S(E). Thus, 
the Chernoff variable would be the force A (or s) rather than 
the inverse temperature (3. Also, considering the temperature 
as being fixed throughout, we can view the weights {Q(x)} (in 
the rate-distortion application) as part of the Hamiltonian £q, 
which now may depend on the control parameter A. This also 
allows combinations of two or more large deviations events 
since one may consider a system that is subjected to more 
than one force, e.g., two or three components of same force, 
or a superposition of different types of forces. 

Specifically, let us first compare the Helmholtz free energy 
expression (f3]l to the rate function [5| of the simple large devi- 
ations event Vi — n Y} w.r.t. some probability distribution 
P: 



I(Y) = max 



sY - lim - In ( V P(y)e s s * ; 



which in the case where {yi} are i.i.d. (P(y) = J|-i P(Vi))^ 
boils down to 



max 



sY -ln^P(y)< 



Fixing the temperature T to some Tq — 1/ (kflo), taking y = x 
and £q(x) = £o(y) = — kToln P(y), we readily see that 
I(Y) coincides with F((3q, Y) up to the multiplicative constant 
factor of kTo, which is immaterial. We observe then that the 
large deviations rate function has a natural interpretation as 
the Helmholtz free energy (in units of fcTo) of a system with 
Hamiltonian 

£ (y) = -kT In P(y) 
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and temperature To. As said, the Chernoff parameter s has 
(again, within the factor (3q) the meaning of a driving force 
that acts on the displacement variables {yi} (cf. e.g., the 
above example of the one-dimensional harmonic oscillator, 
which makes it explicit). For example, in the i.i.d. case, the 
driving force s required to shift the expectation of each y^ 
(and hence also of — j/j) towards Y, which is the solution 
to the equation 



of s, it is easy to show that any protocol of changing a from 
to s, in a way that includes abrupt changes in I, would 
always yield an amount of work larger than or equal to I(s) 
(which is consistent with the operative meaning of I(s) as the 
free energy of the system - see footnote no. 1). Thus, for any 
sequence, si,...,si, of numbers between and s, we can 
sandwich 1(a) between two bounds 



or equivalently, 



Y = 



E y P(y) ■ ye ay 



The Legendre transform relation between the log-partition 
function and I(Y) induces a one-to-one mapping between 
Y and s which is defined by the above equation. To empha- 
size this dependency, we henceforth denote the value of Y, 
corresponding to a given s, by (y) s , which symbolizes the 
fact that it is the expectation^ of each t/j, denoted genetically 
by y, w.r.t. the probability distribution P s = {P s (y)}, where 



y' 



i.e. 



E» p(v) ■ ye sy _ d 
E y P(y)-z sy ds 



(y) s = "1 =^yp(y)e sy - 



y 



On substituting (y) instead of Y in the expression defining 
I(Y), we can re-define the rate function as a function of (the 
maximizing) s, i.e., 

i(s)= S (y) s -\nJ2P(v)e sy . 



Note that 1(a) can be represented in an integral form as 
follows: 



1(a) 



ds 



[y) e 



\y}s + s — jt^ - (y)s 



ds 



(5) 



(y) 



Now observe that the integrand is a product of the force, a, 
and an infinitesimal displacement that it works upon, d (y) , = 
(y)s ~ (y)s-ds (which m turn is the response of the system to 
a corresponding infinitesimal change in the force from a — ds 
to s). In physical terms, a ■ d (y}~ is therefore an infinitesimal 
contribution of the average work (in units of fclo) done by the 
driving force a on the displacement variables {yi}. Thus, the 
integral, I(s) = J s-d (y) s is the total amount of work (again, 
in units of kTo) carried out by the force a, as it increases 
from zero to s during a slow process that allows the system 
to equilibrate after every infinitesimally small change in a. 
In the language of physics, this is a reversible process, or a 
quasi-static process. Using the concavity of F as a function 



which become tighter and tighter as the partition of the interval 
[0, s], defined by {si}f =1 , becomes more refined. 

For an alternative integral expression, one observes that 
d (y) s /ds — (y 2 ) s — {y) 2 s = Var s {y}, namely, the variance of 
y w.r.t. the probability distribution P s . Thus, 



and 



1(a) 



(y)s = (y)o 



Vari{y}ds 



Vars{y}ds. 



Note that, by the same token, in the interpretation of (9), 
where the Chernoff parameter was the inverse temperature 
j3, that is conjugate to the Hamiltonian £, the corresponding 
integral could have been represented as J /3-d (£) a = f j§r, Q 
being heat, which is the change of entropy along a reversible 
process. The corresponding variance expressions would then 
be related to the heat capacity at constant volume. In the more 
general context considered here, this is a special case of the 
fluctuation-dissipation theorem in statistical physics (cf. e.g., 
M P- 32, eq. (2.44)]). 

We next discuss a physical example which will be directly 
relevant for the rate-distortion problem. 

Example 2 (7J p. 134, Problem 13]: Consider a physical 
system, modeled as a one-dimensional array of n elements 
(depicted as small springs in Fig. |5J, that are arranged along 
a straight line. Each element may independently be in one of 
two states, A or B (e.g., in state A the element is stretched 
and in state B, it is contracted, according to Fig. [2}- The state 
of the i-th element, i = 1,2, ... ,n, is labeled ii € {A, B}. 
When an element is at state x, its length is y% and its internal 
energy is e$. A stretching force A > (or a contracting force, 
if A < 0) is applied to one edge of the array, whereas the 
other edge is fixed to a wall. What is the expected (and most 
probable) total length L — nY of the array at temperature TqI 




1 In the sequel, we use (■) to denote other moments of y w.r.t. P a as well. 



Fig. 2. One-dimensional array of two-state elements. 
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Since the elements are independent, 

Z n {Po, A) 



E---E cx p- 



-A) 



xi=0 



-/3o(e> 



n=0 

-Aim) 



-/3o(eB-Ay B )in 



(6) 



and so, 

nG n (/3 , A) 



-fcT lnZ„(/3 ,A) 



= —nkTn In 



-A)(e, 



-Aim) _|_ g-/3o(ei 



The expected length is 

dG n ((3 ,X) 
~ n - dX 



nY = 



-0o(<^b-^Vb)] 



- (7) 

In terms of the foregoing discussion, s — /3oX, the force scaled 
by /3o, controls the expected length per element which is 



Y = (y) s 



VBe 



The free energy per element is then 

F(0o, Y) = -kT Q In [ e ~^ A+SVA + e -P° tB+SVB ] 



kT sY 



where s is related to Y according to second to the last 
equation, which is also the value of s that maximizes the last 
expression. 

Consider now two arrays as above, labeled by x 6 {a, b}, 
which consist of two different types of elements. Array x has 
n(x) elements, and as before, each element of this array may 
be in one of two states, A or B. When an element of array 
x is at state x, its length is y$\ x and its internal energy is 
££i x . The two arrays are connected together to form a larger 
system with a total of n — n(a) + n(b) elements, and this 
larger system is stretched (or shrinked) so that its edges are 
fixed at two points which are at distance nYo far apart. What 
is the contribution of each individual array to the total length, 
nY, and what is the force 'felt' by each one of them? 

Denoting p a = n(a)/n and p b = n(b)/n, the total free 
energy per element is given by 



p a F a (Po,Y a ) 
PaF a (f3o,Y b )+ Pb F b [{3 , 



p b F b ((3 Q ,Y b ) 
Y 



PaYa 



Pb 



(8) 



where F a and F b are the Helmholtz free energies per element 
(cf. above) pertaining to the two arrays, respectively, and 
Y a and Y b are their normalized lengths. At equilibrium, p a 
minimizes this expression, and the minimizing p a solves the 
equation: 



dF a (p ,Y) 



dY 



dF b (p ,Y) 



dY 



Y=Y a Y={Yo-PaY a )/pb 

But the left-hand side is A a = kT s a , the force felt by array 
(a), and similarly, the right-hand side is X b — kT s b , the force 
felt by array (b). The last equation tells us that in mechanical 
equilibrium they are equal, which makes sense, as otherwise 



the boundary point between the two arrays would keep moving 
in either direction^ In other words, the equilibrium values of 
Y a and Y b are adjusted in a way that 



and 



F a (Po, Y a ) = max[G Q (/3 , A) + AY" a ] 



F b (p Q , Y b ) = max[G 6 (/3o, A) + An] 



would be both maximized by the same value of A (or, 
equivalently, s). In this situation, the same value of A would 
also achieve the maximum of the weighted sum: 

max[p a G a (/3 , A) + p b G b (/3 , A) + XY ], 

which treats the entire system as a whole. The maximizing 
value of A is the one that corresponds to total length Yq, This 
concludes Example 2. □ 

In the next section, we will see how Example 2 (especially, it 
second part, with two connected arrays of elements) is directly 
applicable to the rate-distortion setting. 

IV. Rate-Distortion 

Let us consider now the rate-distortion coding problem. 
We are given a source sequence x = (xx,...,x n ) to be 
compressed, whose letters {xi} take on values in a finite 
alphabet X of size K. We assume that the source has a 
given empirical distribution P = {P(x), x S X} (typically, 
close to the real distribution), i.e., each letter x £ X appears 
n(x) = nP{x) times in x. Next consider a random selection 
of a reproduction codeword x = (xi, . . . , x n ), where each 
reproduction symbol X{ is drawn i.i.d. from a distribution 
Q = {Q(x), x G X}, where A" is a finite reproduction 
alphabet of size J. For the most part of our discussion, it will 
be assumed that even if the desired distortion level varies, the 
random coding distribution Q is nevertheless kept fixed, for the 
sake of simplicity!^ It is well known that the rate-distortion 
function of the source P, w.r.t. a given distortion measure 
d(x, x), is given by the rate function of the large deviations 
event {Y^ =1 d{x i} Xi) < nA}. 

Occasionally, instead of working with the reproduction 
symbols as our RV's, we will sometimes work directly with 
the distortions {d(xi, Xi)} incurred, which will be denoted by 
{Si} (playing the same role as {y^} thus far). Accordingly, we 
define 

Q(6\x) - J2 

{£: d{x : x)— 8} 

8 This is similar to the classical mechanical equilibrium between two 
volumes of gas separated by a freely moving plate, which stabilizes at the 
point where the pressures from both sides equalize. 

9 A word of clarification is in order here: Earlier, we mentioned that the 
optimum Q may depend on s, or equivalently on A. In the sequel, we describe 
certain processes along which the distortion level varies, starting from a very 
high distortion level Ao, and ending at a given, desired distortion level, A. 
To make a statement concerning the rate-distortion function, computed at 
the latter distortion level, -R(A), we can always pick the optimum Q for 
this target value of A and keep it fixed, even when considering the above- 
mentioned higher distortion levels. Thus, in these processes, for distortion 
levels above A, we will, in general, 'move' along the curve Rq(-), which 
is the rate-distortion function with an output distribution constrained to Q, 
rather than the curve J?( ). Of course, the two curves intersect at distortion 
A. The analysis can be modified to allow Q depend on s along the process 
(see comment no. 4 on this in Section 6). 



x 



Thus, we think of the distortion S as a RV drawn from 
a distribution Q(5\x) indexed by the corresponding source 
symbol x, rather than as a function of x and a RV x, whose 
distribution Q(x) does not depend on x. The large deviations 
event under consideration is then {J^ILi &i — nA }, where 
{5i} are still independent, but no longer identically distributed. 
For each x € X, n(x) = nP{x) of these RV's are drawn from 
Q(S\x). The large deviations rate function, obtained when all 
{Si} are handled as a whole, is given by 



1(A) 



sA-J2P(x)ln[J2Q(S\ 



x)e 



s5 



x£X 



In analogy to the results of |]9) (see also Subsection 2A), 
another look is the following: Consider the partial distortions, 
sorted according to the underlying source symbols, i.e., for 
each x G X, S{ is the total distortion contributed by 

x. Clearly, the large deviations event under discussion occurs 
iff there exists a distortion allocation T> = {A x , x G X} with 

J2 xe x p ( x )^ ^ A such mat Si: x,=x S i ^ n(x)A x for 
all x € X. Thus, it can be thought of as the union (over all 
possible distortion allocations) of the intersections (over X) of 
the independent events {XL x — x Vi < n(x)A x }, As shown 
in since the effective number of distortion allocations is 
polynomial in n, the probability is dominated by the worst 
allocation, which yields 



1(A) 



min y P(%) 

{V: P(x)A x <A} ^ 



x£X 



max 



n£Q(8\x)e a ' s j 



(9) 



We argue that 1(A) = 1(A) and hence both coincide with 
the rate-distortion function Rq(A) w.r.t. the random coding 
distribution Q. 

Before we prove it formally, we comment that the intuition 
comes from interpreting the expressions of the rate functions in 
the framework of the above example of stretching/contracting 
concatenated one dimensional arrays of elements. Here, we 
have \X\ = K different arrays at temperature T , concatenated 
together to form one larger system with a total of n elements. 
Each individual array is labeled by x G X and it contains 
n(x) = nP(x) elements. Each such element may be in one 
of J states, labeled by x G X. The 'length' and the internal 
energy of an element of array x at state x are 5 X \ X — d(x,x) 
and e x \ x = — kTo\nQ(x) (independent of x), respectively. 
Upon identifying this mapping between the rate — distortion 
problem and the physical example, we immediately see that 
their mathematical formalisms, and hence also their properties, 
are precisely the same. Indeed, the expression of 1(A) is the 
Helmholtz free energy (in units of kTo) per element (pertaining 
to the entire system as a whole) when the total length is 
shrinked to nA. On the other hand, the expression of 1(A) 
describes the minimum Helmholtz free energy (again, in units 
of kTo) across all partial length allocations {n(x)A x } x ^x 
that comply with a total length not exceeding nA. But this 
minimum free energy is achieved when all individual arrays 
'feel' the same force, i.e., the same value of s x . Hence, the two 



expressions should coincide. This means, among other things, 
that the typical relative contribution of each source symbol x 
to the distortion behaves exactly like the relative lengths of 
the individual arrays when they lie in mechanical equilibrium. 

Formally, the following proof is similar to that of ||9] 
Theorem 1], but for completeness, we provide it here too. We 
first prove that 1(A) > 1(A) and then the reversed inequality. 



1(A) = min N P(x) ■ max \s x A x 



mhTQ(5|x)e 



mm 

{V: E.gAr P{x)A x <A 



max [s x P(x)A x - 



xex 
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P(x)hx f ^2,Q(5\x)e 
> min max > \sP(x)A x 



xex 



P(x)ln ^Q(5|a 



> min max 

{V: P^A^A} s<0 



A ^( a 



x£X 



x£X \ S / . 

> min max \sA- 



^P(x)ln [J2Q(S\x)e 



x£X 



= max 

s<0 



sA- P(x)ba. l^2Q{5\x)e s 

x<=X \ 8 



(10) 



where we have used the fact that the sum of maxima is cannot 
be smaller than the maximum of a sum, as well as the fact 
that the optimum s is to be sought in the range s < 0, and so, 
Ese* P(x)A x < A implies sY, x£X P(x)A x > sA. 

In the other direction, let s* be the achiever of 1(A), 
namely, the solution s to the equation 



x£X \ S 



and consider the distortion allocation 



A* 



m (Y,Q(5\x)t 



sS 



which obviously complies with the overall distortion con- 
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straint. Thus, 

1(A) = 



mm 

{C: E^* P(*)A„<A 



max 

Sx<0 



E p w x 

' xeX 



re* 

E p w 



This completes the proof that t(A) = 1(A). □ 



(ID 



Comment: As noted in [9|, our discussion in this section, as 
well as in the next section, applies to channel capacity too, 
provided that P = {P(x)} is understood as the channel output 
distribution, Q = {Q(x)} is the random (channel) coding 
distribution, the distortion measure is taken to be d(x, x) = 
— lnW(x\x), where W is the transition probability matrix 
associated with the memoryless channel, and the "distortion 
level" is set to A = — J2 x ~ Q(x)W(x\x) In W(x\x). In this 
case, the maximizing s is always s* = 1, 

V. Integral Representations 

In view of the observations made in Section 3, it is inter- 
esting to represent the rate-distortion function as mechanical 
work carried out on the distortion variable along a reversible 
process, as well as in terms of the integrated variance of the 
distortion: 



R Q (A) = 

= E p (- T ) • / d§ • § • Var s >w, ( 12 ) 

xex Jo 
where s is related to A via the relation 

~P(x) (8) s]x = A 



xex J ( s )o\ x 



xeX 



and where (8) s < x and Var s |.,.{<!)} are defined in the spirit of 
the earlier definitions of (y) and Var s {y} except that y is 
replaced by 8 and P s now includes conditioning on x. I.e., 



and 



Var s \x{8} 



J2s(S~(S) slx ) 2 Q(5\x)e sS 

Z 5 Q(S\x)e sS 
J2 s 5 2 Q(S\x)e s 



(<%*• d3) 



Upper and lower bounds can be obtained from 

E^)-E^(w s , +1 |,-w Sl |j 

xex i=i 

£-1 

< ^2P(x)-^2s i+1 ((8) Si+ilx -(6) Silx ). (14) 
xex i=i 

The integrated variance formula above can also be represented 

as 

Rq(A s ) — / ds-s- }] P(x)-VaTs\x{3} = / ds-s-mmse(s), 
Jo xex J ° 

where mmse(s) is the minimum mean squared error (MMSE) 
in estimating the RV 8 based on x, when they are jointly dis- 
tributed according to P s (x,8) — P(x)P s (8\x), with P s (S\x) 
being defined as 

p (5\x) 9m^i 

At the same time, the distortion itself, (8) , which we also 
denote by A, can be represented using similar integrals, but 
without the factor s at the integrand: 

A = (8) s 

= E P ( x ) ' ko|x + f S ds-V mix {8} 
xex L Jo 

= Aq + ds ■ mmse(s). 
Jo 

Example 3. Consider the binary symmetric source (BSS) and 
the Hamming distortion measure. In this case, the optimum Q 
is also symmetric. Here 8 is a binary RV with Pi{8 = l\x} = 
e s /(l + e s ) independently of x. Thus, the MMSE estimator 
of 8 based on a: is 

8 = —, 

l + e s 

regardless of x, and so the resulting MMSE is easily found to 
be 



(15) 



mmse(s) = 



(1 + e*f 



Accordingly, 



1 f s e s ds 
A = - - 



(1 + e s ) 2 1 + e s 



and 



R(A) = 



se s ds 



o (1 + e^) 2 



In 2 



1 + e s 



ln(l + e s 



In 2 - hi 



1 + e s 
= ln2-/i 2 (A), 



(16) 



where h2(u) = — ulnu— (1— u) ln(l— u) is the binary entropy 
function. This concludes Example 3. □ 

The integrated variance expression can be generalized as 
follows: Let = t(x, x) be a given function of x and x 
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and let (0) denote the expectation of t(x, x) w.r.t. the joint 
distribution of x and x defined by 

_ P(x)Q(x)e^ 

This characterizes the expected (and typical) value of 
— J^iLi t(xi, where x — . . . , x n ) continues to be the 
codeword that encodes x from a rate-distortion code designed 
and operated with the metric d0 Then, 

(9) s = (9) Q + f ds ■ £ P(x) ■ Cov s]x {6, 5}, 

where Cov s i x {0, 6} is the covariance between 9 = t(x, x) and 
S = d(x,x), induced by 

Q(x)e sd( ~ x,£ ^ 

QMx) = J2x'Q( i ') eSd{x ' ii ' y 

for fixed x. This is integral form is a somewhat more gen- 
eral version of the fluctuation-dissipation theorem, mentioned 
above. 

VI. Summary and Conclusion 

In this work, we have proposed another look at large 
deviations rate functions (or Chernoff functions), where 
the Chernoff parameter is viewed as 'force' rather than as 
temperature. This leads to the interpretation of fundamental 
quantities in information theory, like the rate-distortion 
function and channel capacity, as free energies of certain 
physical systems. This interpretation has the following 
advantages relative to the one proposed in J9): 

1) As explained in Subsection 2B, there is no need to interpret 
random coding distributions as degeneracy. 

2) As a consequence of 1), we are able to construct an 
example of a physical system whose behavior is analogous to 
that of the rate-distortion coding problem. The properties of 
this system were described in the second to the last paragraph 
of the Introduction. 

3) This interpretation generalizes to rate functions of 
combinations of rare events. In this case, the rate function 
involves several Chernoff variables (one per each event), 
which may correspond to a system with several forces, each 
one acting on its own variable (cf. R(A\, A2) in Subsection 
2B). Our earlier physical example of a one-dimensional array 
can now be extended to two dimensions, where the elements 
are arranged in a rectangular lattice, and each element has 
both a length and a width associated with each state. The 
sum [si J2i di(xi,Xi) + s 2 J2i d 2(xi,Xi)} can be viewed as 
the inner product between a two dimensional force vector 
and a two-dimensional displacement vector. Alternatively, 

10 As motivating examples, consider the case where t is another distortion 
measure - although the codebook is designed and operated relative to the 
metric d, its performance can also be judged relative to an additional metric 
t. If t(x, x) depends on x only, it may serve as a transmission power function 
(in joint source-channel coding) or it can be the length function £(x) 
(in bits) of lossless compression for the individual reproduction symbols. 



si and S2 may designate two different types of forces (e.g., 
a mechanical force and a magnetic force). Either way, our 
derivations extend quite straightforwardly to this setting. 

4) As mentioned before, we assumed throughout the derivation 
that the random coding distribution is fixed, independently 
of the distortion level, that is, independently of s. This is 
why we described i?(A) as a process along the curve Rq(-) 
with the understanding that Q is chosen to be optimum 
for the target distortion A. One can modify the analysis to 
correspond to a process along As mentioned earlier, 
however, in most cases, the optimum Q depends on s, and 
this dependency requires correction terms that depend on 
the expected values of some derivatives of InQ(x) w.r.t. s. 
In the analogous physical interpretation proposed here, s 
continues to be an external control parameter that affects the 
Hamiltonian. The dependence of the Hamiltonian on s would 
now be non-linear, but this may still be physically relevant. 

5) This interpretation as free energy opens the door to new 
points of view on the rate-distortion function, e.g., as work 
done on the distortion variable along a slow process, or as 
integrated variance (or MMSE). 
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